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In fine art, especially painting, humans have mastered the skill to create unique 
visual experiences through composing a complex interplay between the con¬ 
tent and style of an image. Thus far the algorithmic basis of this process is 
unknown and there exists no artificial system with similar capabilities. How¬ 
ever, in other key areas of visual perception such as object and face recognition 
near-human performance was recently demonstrated by a class of biologically 
inspired vision models called Deep Neural Networks.^’^ Here we introduce an 
artificial system based on a Deep Neural Network that creates artistic images 
of high perceptual quality. The system uses neural representations to sepa¬ 
rate and recombine content and style of arbitrary images, providing a neural 
algorithm for the creation of artistic images. Moreover, in light of the strik¬ 
ing similarities between performance-optimised artificial neural networks and 
biological vision,^"’ our work offers a path forward to an algorithmic under¬ 
standing of how humans create and perceive artistic imagery. 
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The class of Deep Neural Networks that are most powerful in image processing tasks are 
called Convolutional Neural Networks. Convolutional Neural Networks consist of layers of 
small computational units that process visual information hierarchically in a feed-forward man¬ 
ner (Fig[^. Each layer of units can be understood as a collection of image filters, each of which 
extracts a certain feature from the input image. Thus, the output of a given layer consists of 
so-called feature maps: differently filtered versions of the input image. 

When Convolutional Neural Networks are trained on object recognition, they develop a 
representation of the image that makes object information increasingly explicit along the pro¬ 
cessing hierarchy.^ Therefore, along the processing hierarchy of the network, the input image 
is transformed into representations that increasingly care about the actual content of the im¬ 
age compared to its detailed pixel values. We can directly visualise the information each layer 
contains about the input image by reconstructing the image only from the feature maps in that 
layer^ (Fig content reconstructions, see Methods for details on how to reconstruct the im¬ 
age). Higher layers in the network capture the high-level content in terms of objects and their 
arrangement in the input image but do not constrain the exact pixel values of the reconstruc¬ 
tion. (Fig content reconstructions d,e). In contrast, reconstructions from the lower layers 
simply reproduce the exact pixel values of the original image (Fig content reconstructions 
a,b,c). We therefore refer to the feature responses in higher layers of the network as the content 
representation. 

To obtain a representation of the style of an input image, we use a feature space originally 
designed to capture texture information.^ This feature space is built on top of the filter responses 
in each layer of the network. It consists of the correlations between the different filter responses 
over the spatial extent of the feature maps (see Methods for details). By including the feature 
correlations of multiple layers, we obtain a stationary, multi-scale representation of the input 
image, which captures its texture information but not the global arrangement. 
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Figure 1: Convolutional Neural Network (CNN). A given input image is represented as a set 
of filtered images at eaeh proeessing stage in the CNN. While the number of different filters 
increases along the processing hierarchy, the size of the filtered images is reduced by some 
downsampling mechanism (e.g. max-pooling) leading to a decrease in the total number of 
units per layer of the network. Content Reconstructions. We can visualise the information 
at different processing stages in the CNN by reconstructing the input image from only know¬ 
ing the network’s responses in a particular layer. We reconstruct the input image from from 
layers ‘convl-l’ (a), ‘conv2_r (b), ‘conv3_r (c), ‘conv4_r (d) and ‘conv5_r (e) of the orig¬ 
inal VGG-Network. We find that reconstruction from lower layers is almost perfect (a,b,c). In 
higher layers of the network, detailed pixel information is lost while the high-level content of the 
image is preserved (d,e). Style Reconstructions. On top of the original CNN representations 
we built a new feature space that captures the style of an input image. The style representation 
computes correlations between the different features in different layers of the CNN. We recon¬ 
struct the style of the input image from style representations built on different subsets of CNN 
layers ( ‘convCl’ (a), ‘convCl’ and ‘conv2_r (b), ‘convCl’, ‘conv2_r and ‘conv3_r (c), 
‘convl.r, ‘conv2_l’, ‘conv3_r and ‘conv4_r (d), ‘convl.l’, ‘conv2_r, ‘conv3_r, ‘conv4_r 
and ‘conv5_r (e)). This creates images that match the style of a given image on an increasing 
scale while discarding information of the global arrangement of the scene. 
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Again, we can visualise the information captured by these style feature spaces built on 
different layers of the network by constructing an image that matches the style representation 
of a given input image (Fig[^ style reconstructions).Indeed reconstructions from the style 
features produce texturised versions of the input image that capture its general appearance in 
terms of colour and localised structures. Moreover, the size and complexity of local image 
structures from the input image increases along the hierarchy, a result that can be explained 
by the increasing receptive field sizes and feature complexity. We refer to this multi-scale 
representation as style representation. 

The key finding of this paper is that the representations of content and style in the Convo¬ 
lutional Neural Network are separable. That is, we can manipulate both representations inde¬ 
pendently to produce new, perceptually meaningful images. To demonstrate this finding, we 
generate images that mix the content and style representation from two different source images. 
In particular, we match the content representation of a photograph depicting the “Neckarfront” 
in Tubingen, Germany and the style representations of several well-known artworks taken from 
different periods of art (Fig|^. 

The images are synthesised by finding an image that simultaneously matches the content 
representation of the photograph and the style representation of the respective piece of art (see 
Methods for details). While the global arrangement of the original photograph is preserved, 
the colours and local structures that compose the global scenery are provided by the artwork. 
Effectively, this renders the photograph in the style of the artwork, such that the appearance of 
the synthesised image resembles the work of art, even though it shows the same content as the 
photograph. 

As outlined above, the style representation is a multi-scale representation that includes mul¬ 
tiple layers of the neural network. In the images we have shown in Fig|^ the style representation 
included layers from the whole network hierarchy. Style can also be defined more locally by 
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Figure 2: Images that eombine the eontent of a photograph with the style of several well-known 
artworks. The images were ereated by finding an image that simultaneously matehes the eontent 
representation of the photograph and the style representation of the artwork (see Methods). The 
original photograph depicting the Neckarfront in Tubingen, Germany, is shown in A (Photo: 
Andreas Praefcke). The painting that provided the style for the respective generated image 
is shown in the bottom left comer of each panel. B The Shipwreck of the Minotaur by J.M.W. 
Turner, 1805. C Tfie Mgfit by Vincent v^ Gogh, 1889. D Der Scfire/by Edvard Munch, 
1893. E Femme nue assise by Pablo Picasso, 1910. F Composition VII by Wassily Kandinsky, 
1913. 







including only a smaller number of lower layers, leading to different visual experienees (Fig[^ 
along the rows). When matehing the style representations up to higher layers in the network, 
loeal images struetures are matehed on an inereasingly large seale, leading to a smoother and 
more eontinuous visual experienee. Thus, the visually most appealing images are usually ere- 
ated by matehing the style representation up to the highest layers in the network (Fig last 
row). 

Of eourse, image eontent and style eannot be eompletely disentangled. When synthesising 
an image that eombines the eontent of one image with the style of another, there usually does 
not exist an image that perfeetly matehes both eonstraints at the same time. However, the 
loss funetion we minimise during image synthesis eontains two terms for eontent and style 
respeetively, that are well separated (see Methods). We ean therefore smoothly regulate the 
emphasis on either reeonstrueting the eontent or the style (Fig|^ along the eolumns). A strong 
emphasis on style will result in images that mateh the appearanee of the artwork, effeetively 
giving a texturised version of it, but hardly show any of the photograph’s eontent (Fig first 
eolumn). When plaeing strong emphasis on eontent, one ean elearly identify the photograph, 
but the style of the painting is not as well-matehed (Fig[^ last eolumn). For a speeifie pair of 
souree images one ean adjust the trade-off between eontent and style to ereate visually appealing 
images. 

Here we present an artifieial neural system that aehieves a separation of image eontent from 
style, thus allowing to reeast the eontent of one image in the style of any other image. We 
demonstrate this by ereating new, artistie images that eombine the style of several well-known 
paintings with the eontent of an arbitrarily ehosen photograph. In partieular, we derive the 
neural representations for the eontent and style of an image from the feature responses of high- 
performing Deep Neural Networks trained on objeet reeognition. To our knowledge this is the 
first demonstration of image features separating eontent from style in whole natural images. 
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Figure 3: Detailed results for the style of the painting Composition VII by Wassily Kandinsky. 
The rows show the result of matehing the style representation of inereasing subsets of the CNN 
layers (see Methods). We find that the loeal image struetures eaptured by the style represen¬ 
tation inerease in size and eomplexity when ineluding style features from higher layers of the 
network. This ean be explained by the inereasing reeeptive field sizes and feature complex¬ 
ity along the network’s processing hierarchy. The columns show different relative weightings 
between the content and style reconstruction. The number above each column indicates the 
ratio a/(3 between the emphasis on matching the content of the photograph and the style of the 
artwork (see Methods). 
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Previous work on separating content from style was evaluated on sensory inputs of much lesser 
complexity, such as characters in different handwriting or images of faces or small figures in 
different poses. 

In our demonstration, we render a given photograph in the style of a range of well-known 
artworks. This problem is usually approached in a branch of computer vision called non- 
photorealistic rendering (for recent review see^"^). Conceptually most closely related are meth¬ 
ods using texture transfer to achieve artistic style transfer. However, these previous ap¬ 
proaches mainly rely on non-parametric techniques to directly manipulate the pixel representa¬ 
tion of an image. In contrast, by using Deep Neural Networks trained on object recognition, we 
carry out manipulations in feature spaces that explicitly represent the high level content of an 
image. 

Features from Deep Neural Networks trained on object recognition have been previously 
used for style recognition in order to classify artworks according to the period in which they 
were created.^*’ There, classifiers are trained on top of the raw network activations, which we 
call content representations. We conjecture that a transformation into a stationary feature space 
such as our style representation might achieve even better performance in style classification. 

In general, our method of synthesising images that mix content and style from different 
sources, provides a new, fascinating tool to study the perception and neural representation of 
art, style and content-independent image appearance in general. We can design novel stimuli 
that introduce two independent, perceptually meaningful sources of variation: the appearance 
and the content of an image. We envision that this will be useful for a wide range of experimen¬ 
tal studies concerning visual perception ranging from psychophysics over functional imaging 
to even electrophysiological neural recordings. In fact, our work offers an algorithmic under¬ 
standing of how neural representations can independently capture the content of an image and 
the style in which it is presented. Importantly, the mathematical form of our style representa- 
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tions generates a clear, testable hypothesis about the representation of image appearance down 
to the single neuron level. The style representations simply compute the correlations between 
different types of neurons in the network. Extracting correlations between neurons is a bio¬ 
logically plausible computation that is, for example, implemented by so-called complex cells 
in the primary visual system (VI).Our results suggest that performing a complex-cell l ik e 
computation at different processing stages along the ventral stream would be a possible way to 
obtain a content-independent representation of the appearance of a visual input. 

All in all it is truly fascinating that a neural system, which is trained to perform one of the 
core computational tasks of biological vision, automatically learns image representations that 
allow the separation of image content from style. The explanation could be that when learning 
object recognition, the network has to become invariant to all image variation that preserves 
object identity. Representations that factorise the variation in the content of an image and the 
variation in its appearance would be extremely practical for this task. Thus, our ability to 
abstract content from style and therefore our ability to create and enjoy art might be primarily a 
preeminent signature of the powerful inference capabilities of our visual system. 

Methods 

The results presented in the main text were generated on the basis of the VGG-Network,^^ 
a Convolutional Neural Network that rivals human performance on a common visual object 
recognition benchmark task^^ and was introduced and extensively described in.^^ We used the 
feature space provided by the 16 convolutional and 5 pooling layers of the 19 layer VGG- 
Network. We do not use any of the fully connected layers.The model is publicly available and 
can be explored in the caffe-framework.^"^ For image synthesis we found that replacing the 
max-pooling operation by average pooling improves the gradient flow and one obtains slightly 
more appealing results, which is why the images shown were generated with average pooling. 
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Generally eaeh layer in the network defines a non-linear filter bank whose eomplexity in- 
ereases with the position of the layer in the network. Henee a given input image x is eneoded 
in eaeh layer of the CNN by the filter responses to that image. A layer with Ni distinet filters 
has Ni feature maps eaeh of size Mi, where Mi is the height times the width of the feature map. 
So the responses in a layer I ean be stored in a matrix F’- G where F-j is the aetivation 

of the filter at position j in layer 1. To visualise the image information that is eneoded at 
different layers of the hierarehy (Fig eontent reeonstruetions) we perform gradient deseent 
on a white noise image to find another image that matehes the feature responses of the original 
image. So let p and x be the original image and the image that is generated and and P^ their 
respeetive feature representation in layer 1. We then define the squared-error loss between the 
two feature representations 

Ccontent{p, P 0 = ^ [Hj “ ^ij)^ • (1) 

ij 

The derivative of this loss with respeet to the aetivations in layer I equals 

dCcontent _ f ^ij > ^ .of 

dFlj \o if 4 < 0 . 

from whieh the gradient with respeet to the image x ean be eomputed using standard error 
baek-propagation. Thus we ean ehange the initially random image x until it generates the same 
response in a eertain layer of the CNN as the original image p. The five eontent reeonstruetions 
in Figl^are from layers ‘eonvl.F (a), ‘eonv2_r (b), ‘eonv3_r (e), ‘eonv4_r (d) and ‘eonvS.F 
(e) of the original VGG-Network. 

On top of the CNN responses in eaeh layer of the network we built a style representation 
that eomputes the eorrelations between the different filter responses, where the expeetation is 
taken over the spatial extend of the input image. These feature eorrelations are given by the 
Gram matrix G' G where G\^ is the inner produet between the veetorised feature map 
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i and j in layer l\ 

G', = y; (3) 

k 

To generate a texture that matehes the style of a given image (Fig style reeonstruetions), 
we use gradient deseent from a white noise image to find another image that matehes the style 
representation of the original image. This is done by minimising the mean-squared distanee 
between the entries of the Gram matrix from the original image and the Gram matrix of the 
image to be generated. So let a and x be the original image and the image that is generated and 
and their respeetive style representations in layer 1. The eontribution of that layer to the 
total loss is then 

^ ANfMf ^ ~ 

and the total loss is 

L 

^styleia, x) ='^ WiEi (5) 

1=0 

where wi are weighting faetors of the eontribution of eaeh layer to the total loss (see below for 
speeifie values of wi in our results). The derivative of Ei with respeet to the aetivations in layer 
1 ean be eomputed analytieally: 

dE, ( Ji^ (G‘- A')) ifF'>0 

The gradients of Ei with respeet to the aetivations in lower layers of the network ean be readily 
eomputed using standard error baek-propagation. The five style reeonstruetions in Fig were 
generated by matehingthe style representations on layer ‘eonvl.F (a), ‘eonvl.l’ and ‘eonv2_r 
(b), ‘eonvl_r, ‘eonv2_r and ‘eonv3_r (e), ‘eonvGF, ‘eonv2_r, ‘eonv3_r and ‘eonv4_r (d), 
‘eonvl-F, ‘eonv2_r, ‘eonv3_r, ‘eonv4_r and ‘eonvS.F (e). 

To generate the images that mix the eontent of a photograph with the style of a painting 
(Fig|^ we jointly minimise the distanee of a white noise image from the eontent representation 
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of the photograph in one layer of the network and the style representation of the painting in a 
number of layers of the CNN. So let p be the photograph and a be the artwork. The loss funetion 
we minimise is 


^totali^Py ^ content i^P y T ^ stylei,^ y ^') ("7) 

where a and (3 are the weighting faetors for eontent and style reeonstruetion respeetively. For 
the images shown in Fig we matehed the eontent representation on layer ‘eonv4_2’ and the 
style representations on layers ‘eonvl.l’, ‘eonv2_r, ‘eonv3_r, ‘eonv4_r and ‘eonv5_r {wi = 

1/5 in those layers, = 0 in all other layers). The ratio a/fi was either 1 x 10-3 (Fig@B,C,D) 
or 1 X 10“"^ (Fig|^E,F). Fig[3]shows results for different relative weightings of the eontent and 
style reeonstruetion loss (along the eolumns) and for matehing the style representations only 
on layer ‘eonvl_r (A), ‘eonvl_r and ‘eonv2_r (B), ‘eonvl_r, ‘eonv2_r and ‘eonv3_r (C), 
‘eonvl.r, ‘eonv2_r, ‘eonv3_r and ‘eonv4_r (D), ‘eonvl.l’, ‘eonv2_r, ‘eonv3_r, ‘eonv4_r 
and ‘eonv5_l ’ (E). The faetor wi was always equal to one divided by the number of aetive layers 
with a non-zero loss-weight wi. 
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